Packet Traffic Learning
Proposal
Dataset
network_traffic_train = pd.read_csv('data/KDDTrain+.txt')
network_traffic_test = pd.read_csv('data/KDDTest+.txt')
df_train = network_traffic_train.copy()
df_test = network_traffic_test.copy()
'''
Columns recieved from kaggle project
https://www.kaggle.com/code/faizankhandeshmukh/intrusion-detection-system
'''
# Define the list of column names based on the NSL-KDD dataset description
columns = [
'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes',
'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins',
'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root',
'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds',
'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate',
'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
'dst_host_srv_rerror_rate', 'attack', 'level'
]
# Assign the column names to the dataframe
df_train.columns = columns
df_test.columns = columns
print('Shapes (train, test):', df_train.shape, df_test.shape)Shapes (train, test): (125972, 43) (22543, 43)
We are using a training and testing dataset of network intrusion detection from NSL-KDD from Kaggle. The intrusion detection network traffic training dataset contains 125,972 rows and 43 columns, and 22,543 rows and 43 columns in the test dataset.
The attack field indicates normal or anomalous (multi-class) observations which allows us to use learning approaches for classifying anomalous network activity. A new binary classification feature, is_anomalous, will be added to indicate if the network connection was anomalous or not. This will be the target field for the project.
We chose this dataset because it provides a rich and realistic representation of network traffic data. The presence of labeled data allows us to train and evaluate supervised models; the diversity and volume of traffic patterns make it well-suited for exploring unsupervised anomaly detection techniques as well. This balance between complexity and feature richness aligns well with our research questions and modeling goals.
Questions
Q1. Using supervised machine learning models such as Long Short-Term Memory (LSTM) and Support Vector Machines (SVMs), can we accurately calssify network traffic as normal and anomalous based on labeled data? How do their performances compare in terms of accuracy, precision, recall, and F1-score?
Q2. Can unsupervised learning methods such as K-Means Clustering and Density-Based Clustering (DBSCAN) detect anomalous patterns in network traffic without using labeled data?
Summary. How do the supervised and unsupervised approaches compare?
Dataset Analysis
Variables
| Column Name | Data Type | Description |
|---|---|---|
duration |
int64 | Length (in seconds) of the connection. |
protocol_type |
object | Protocol used (e.g., tcp, udp, icmp). |
service |
object | Network service on the destination (e.g., http, telnet). |
flag |
object | Status flag of the connection. |
src_bytes |
int64 | Number of data bytes sent from source to destination. |
dst_bytes |
int64 | Number of data bytes sent from destination to source. |
land |
int64 | 1 if connection is from/to the same host/port; 0 otherwise. |
wrong_fragment |
int64 | Number of wrong fragments. |
urgent |
int64 | Number of urgent packets. |
hot |
int64 | Number of âhotâ indicators. |
num_failed_logins |
int64 | Number of failed login attempts. |
logged_in |
int64 | 1 if successfully logged in; 0 otherwise. |
num_compromised |
int64 | Number of compromised conditions. |
root_shell |
int64 | 1 if root shell is obtained; 0 otherwise. |
su_attempted |
int64 | 1 if âsu rootâ command attempted; 0 otherwise. |
num_root |
int64 | Number of ârootâ accesses. |
num_file_creations |
int64 | Number of file creation operations. |
num_shells |
int64 | Number of shell prompts invoked. |
num_access_files |
int64 | Number of accesses to control files. |
num_outbound_cmds |
int64 | Number of outbound commands (always 0 in KDD99). |
is_host_login |
int64 | 1 if login is to a host account; 0 otherwise. |
is_guest_login |
int64 | 1 if login is to a guest account; 0 otherwise. |
count |
int64 | Number of connections to the same host in the past 2 seconds. |
srv_count |
int64 | Number of connections to the same service in the past 2 seconds. |
serror_rate |
float64 | % of connections with SYN errors. |
srv_serror_rate |
float64 | % of connections to the same service with SYN errors. |
rerror_rate |
float64 | % of connections with REJ errors. |
srv_rerror_rate |
float64 | % of connections to the same service with REJ errors. |
same_srv_rate |
float64 | % of connections to the same service. |
diff_srv_rate |
float64 | % of connections to different services. |
srv_diff_host_rate |
float64 | % of connections to different hosts on the same service. |
dst_host_count |
int64 | Number of connections to the destination host. |
dst_host_srv_count |
int64 | Number of connections to the destination host and service. |
dst_host_same_srv_rate |
float64 | % of connections to the same service on the destination host. |
dst_host_diff_srv_rate |
float64 | % of connections to different services on the destination host. |
dst_host_same_src_port_rate |
float64 | % of connections from the same source port. |
dst_host_srv_diff_host_rate |
float64 | % of connections to the same service from different hosts. |
dst_host_serror_rate |
float64 | % of connections with SYN errors to the destination host. |
dst_host_srv_serror_rate |
float64 | % of connections with SYN errors to the destination service. |
dst_host_rerror_rate |
float64 | % of connections with REJ errors to the destination host. |
dst_host_srv_rerror_rate |
float64 | % of connections with REJ errors to the destination service. |
attack |
object | Label indicating the type of attack or ânormalâ. |
level |
int64 | Severity or confidence score of the attack (if available). |
Exploratory Data Analysis
We take a quick look at the training data to see if there are any obvious imbalances.
print("Shape:", df_train.shape)
print("Missing values:", df_train.isna().sum().sum())
print("Duplicates:", df_train.duplicated().sum())
print("Unique attack labels:", df_train['attack'].nunique())
print("Attack label distribution:\n", df_train['attack'].value_counts().head(5))
# Show types and non-null counts
df_train.info(verbose=False)Shape: (125972, 43)
Missing values: 0
Duplicates: 0
Unique attack labels: 23
Attack label distribution:
attack
normal 67342
neptune 41214
satan 3633
ipsweep 3599
portsweep 2931
Name: count, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125972 entries, 0 to 125971
Columns: 43 entries, duration to level
dtypes: float64(15), int64(24), object(4)
memory usage: 41.3+ MB
The brief look at the data is positive. There are plenty of data points and features, depending on time speeds for fitting models, we may decrease our sample size because hyper parameter using GridSearchCV training may be time intensive. In the data there are no missing values, and the glimpse of the attack column provides insight into why we want to collapse it into a binary column.
Normal vs Anomalous Traffic
First, look at the amount of normal v. anomalous data.
attack
normal 67342
neptune 41214
satan 3633
ipsweep 3599
portsweep 2931
smurf 2646
nmap 1493
back 956
teardrop 892
warezclient 890
pod 201
guess_passwd 53
buffer_overflow 30
warezmaster 20
land 18
imap 11
rootkit 10
loadmodule 9
ftp_write 8
multihop 7
phf 4
perl 3
spy 2
Name: count, dtype: int64
The plot provides an idea of the specific attack types expressesd in the data. The plot communicates why it makes sense to group all non-normal traffic together.
We feature engineer a new column, is_anomalous, this contains 0 if the connection is normal and 1 if the connection is not normal.
Examine the new column, is_anaomalous, to get an idea of the target frequency.
| Count | Percentage | |
|---|---|---|
| is_anomalous | ||
| Normal | 67342 | 53.46 |
| Attack | 58630 | 46.54 |
The is_anomalous classification target shows a near-even class distribution indicating the the dataset is well balanced. There should be no need for resampling or class weighting to correct the set. It appears this dataset will be a good candidate for learning models.
Analysis plan
Problem Introduction
The project is to build and evaluate models capable of detecting anomalous network traffic based on connection-level features from the NSL-KDD dataset. The problem is framed as a binary classification task, where each record is labeled as either normal or anomalous. This has real-world applications in intrusion detection systems and network security monitoring.
The project will explore both supervised and unsupervised machine learning techniques to assess their effectiveness in identifying attacks from structured network traffic data.
Feature Engineering Strategy
To ensure a fair and consistent comparison, we will apply the same feature engineering pipeline to both supervised and unsupervised models. All features will be assigned appropriate column names based on the NSL-KDD documentation. Categorical variables such as protocol_type, service, and flag will be one-hot encoded, and low-variance or non-informative columns will be removed. Numeric features will be standardized using a scaler to normalize their ranges.
For supervised models, these engineered features will be used alongside the binary target is_anomalous. For unsupervised models, the same processed features will be used without labels, allowing the models to explore underlying structure or detect anomalous patterns. This consistent preprocessing ensures that differences in performance can be attributed to the modeling approaches rather than inconsistencies in data preparation.
Dimensionality Reduction
Our dataset currently has over 40 features, we will apply a combination of feature reduction techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), or even use the Random Forest Classifier feature importance attribute. Weâll experiment to determine the most optimal feature subset for our classification task.
Q1. Supervised Learning
For the supervised learning portion of the project, we will train and evaluate models using the labeled NSL-KDD dataset to classify network traffic as normal or anomalous. Specifically, we will implement and compare a Long Short-Term Memory (LSTM) neural network and a Support Vector Machine (SVM). These models will be trained on the same feature-engineered data, using the is_anomalous column as the target. Model performance will be assessed using standard classification metrics, including accuracy, precision, recall, F1-score, and ROC AUC.
Q2. Unsupervised Learning
For the unsupervised learning portion of the project, we will explore clustering-based approaches to detect anomalies in network traffic without relying on labeled data. We plan to experiment with techniques such as K-Means and DBSCAN to group similar observations and identify outliers that may correspond to attacks. After clustering, we will evaluate how well the resulting groupings align with the true labels using appropriate metrics for unsupervised learning. This will help us assess the potential of unsupervised models to detect anomalous behavior in the absence of supervision.
Summary Comparison
To compare the supervised and unsupervised approaches, we will evaluate their ability to correctly identify anomalous traffic using relevant metrics for each method. We will also consider practical factors such as interpretability, scalability, and the need for labeled data.
Project Timeline
| Task Name | Status | Due | Priority | Summary |
|---|---|---|---|---|
| Dataset exploration | In Progress | Week 1 | High | Load the dataset, inspect features, handle any preprocessing needs. |
| Define research questions | Complete | Week 1 | High | Clarify goals for supervised and unsupervised anomaly detection. |
| Supervised model development | Not Started | Week 2 | High | Train models like Random Forest, Logistic Regression, and XGBoost. |
| Evaluation of supervised models | Not Started | Week 3 | High | Use accuracy, precision, recall, Fâ, and ROC-AUC to assess performance. |
| Unsupervised model development | Not Started | Week 3 | Medium | Explore methods like Isolation Forest and clustering. |
| Evaluation of unsupervised models | Not Started | Week 4 | Medium | Compare anomaly scores to labeled data using precision-recall metrics. |
| Comparative analysis | Not Started | Week 4 | High | Analyze strengths and weaknesses of both approaches. |
| Final report & presentation | Not Started | Week 5 | High | Compile results, figures, and discussion into final deliverables. |